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1- INTRODUCTION 

There are many fascinating issues discussed in 
this paper. Several concern parapsychology itself 
and the interpretation of statistical methodology 
therein. We are not experts in parapsychology, and 
so have only one comment concerning such mat- 
ters: In Section 3 we briefly discuss the need to 
switch from P-values to Bayes factors in discussing 
evidence concerning parapsychology. 

A more general issue raised in the paper is that 
of replication. It is quite illuminating to consider 
the issue of replication from a Bayesian perspec- 
tive, and this is done in Section 2 of our discussion. 

2. REPLICATION 

Many insightful observations concerning replica- 
tion are given in the article, and these spurred us 
to determine if they could be quantified within 
Bayesian reasoning. Quantification requires clear 
delineation of the possible purposes of replication, 
and at least two are obvious. The first is simple 
reduction of random error, achieved by obtaining 
more observations from the replication. The second 
purpose is to search for possible bias in the original 
experiment. We use "bias" in a loose sense here, to 
refer to any of the huge number of ways in which 
the effects being measured by the experiment can 
differ from the actual effects of interest. Thus a 
clinical trial without a placebo can suffer a placebo 
"bias"; a survey can suffer a "bias" due to the 
sampling frame being unrepresentative of the 
actual population; and possible sources of bias 
in parapsychological experiments have been 
extensively discussed. 

Replication to Reduce Random Error 

If the sole goal of replication of an experiment is 
to reduce random error, matters are very straight- 
forward. Reviewing the Bayesian way of studying 
this issue is, however, useful and will be done 
through the following simple example. 
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Example 1. Consider the example from Tversky 
and Kahnemann (1982), in which an experiment 
results in a standardized test statistic of z x = 2.46. 
(We will assume normality to keep computations 
trivial.) The question is: What is the highest value 
of z 2 in a second set of data that would be consid- 
ered a failure to replicate? Two possible precise 
versions of this question are: Question 1: What is 
the probability of observing z 2 for which the null 
hypothesis would be rejected in the replicated ex- 
periment? Question 2: What value of z 2 would 
leave one's overall opinion about the null hypothe- 
sis unchanged? 

Consider the simple case where Z 1 ~ N(z 1 \ 0, 1) 
and (independently) Z 2 ~ N(z 2 \6, 1), where 0 is 
the mean and 1 is the standard deviation of the 
normal distribution. Note that we are considering 
the case in which no experimental bias is suspected 
and so the means for each experiment are assumed 
to be the same. 

Suppose that it is desired to test H 0 : 0 < 0 versus 
Hi. 0 > 0, and suppose that initial prior opinion 
about 0 can be described by the noninformative 
prior 7r(0) = 1. We consider the one-sided testing 
problem with a constant prior in this section, be- 
cause it is known that then the posterior probabil- 
ity of H 0 , to be denoted by P(H 0 |data), equals the 
P-value, allowing us to avoid complications arising 
from differences between Bayesian and classical 
answers. 

After observing z 1 = 2.46, the posterior distribu- 
tion of 0 is 

*(0\z 1 ) = N(6 |2.46, 1). 

Question 1 then has the answer (using predictive 
Bayesian reasoning) 

P (rejecting at level a \ z t ) 
1 



■/./ 



= 1 - $ 



oo V2tt 
^/2 



g-v a <*2-0) a T (0| Zl ) de dz 2 



where $ is the standard normal cdf and c a is the 
(one-sided) critical value corresponding to the level, 
a, of the test. For instance, if a = 0.05, then this 
probability equals 0.7178, demonstrating that there 
is a quite substantial probability that the second 
experiment will fail to reject. If a is chosen to be 
the observed significance level from the first exper- 
iment, so that c a = z l9 then the probability that the 
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second experiment will reject is just 1/2. This is 
nothing but a statement of the well-known martin- 
gale property of Bayesianism, that what you "ex- 
pect" to see in the future is just what you know 
today. In a sense, therefore, question 1 is exposed 
as being uninteresting. 

Question 2 more properly focuses on the fact that 
the stated goal of replication here is simply to 
reduce uncertainty in stated conclusions. The an- 
swer to the question follows immediately from not- 
ing that the posterior from the combined data 
z 2 ) is 

tt(0 I z l9 z 2 ) = N(0 \(z t + z 2 )/2, l/y/2), 
so that 

P(# 0 |data) = $(-(2! + z 2 )/V2). 

Setting this equal to P(H 0 \ z r ) and solving for z 2 
yields z 2 = (V2 - l)z x = 1.02. Any value of z 2 
greater than this will increase the total evidence 
against if 0 , while any value smaller than 1.02 will 
decrease the evidence. 

Replication to Detect Bias 

The aspirin example dramatically raises the is- 
sue of bias detection as a motive for replication. 
Professor Utts observes that replication 1 gives 
results that are fully compatible with those of the 
original study, which could be interpreted as sug- 
gesting that there is no bias in the original study, 
while replication 2 would raise serious concerns of 
bias. We became vei^r interested in the implicit 
suggestion that replication 2 would thus lead to 
less overall evidence against the null hypothesis 
than would replication 1, even though in isolation 
replication 2 was much more "significant" than 
was replication 1. In attempting to see if this is so, 
we considered the Bayesian approach to study of 
bias within the framework of the aspirin example. 

, Example 2. For simplicity in the aspiring exam- 
ple, we reduce consideration to 

0 = true difference in heart attack rates between 
aspirin and placebo populations multiplied by 
1000; 

Y = difference in observed heart attack rates be- 
tween aspirin and placebo groups in original 
study multiplied by 10Q0; 

X t = difference in observed heart attack rates be- 
tween aspirin and placebo groups in Replica- 
tion / multiplied by 1000. 

We assume that the replication studies are ex- 
tremely well designed and implemented, so that 



one is very confident that the X t have mean 0. 
Using normal approximations for convenience, the 
data can be summarized as 

X 1 -N(x 1 \6 9 4.82), X 2 ~ N(x 2 \6, 3.63) 

with actual observations x 1 = 7.704 and x 2 - 
13.07. 

Consider now the bias issue. We assume that the 
original experiment is somewhat suspect in this 
regard, and we will model bias by defining the 
mean of Y to be 

V = 0 + 0, 

where 0 is the unknown bias. Then the data in the 
original experiment can be summarized by 

Y~N(y\ V , 1.54), 

with the actual observation being y = 1 .101 . 

Bayesian analysis requires specification of a prior 
distribution, 7r(jS), for the suspected amount of bias. 
Of particular interest then are the posterior distri- 
bution of 0, assuming replication / has been 
performed, given by 

tt(0! y, x t ) 

" - 2(l.64« + a i ») l * " iy " Xi)]2 } ' 

where of is the variance (4.82 or 3.63) from repli- 
cation i; and the posterior probability of H 0 , given 
by 

P(H 0 \y, *,) 

= [°° J . ° { (y-fl) 

J— \ 1.54VV + 1.54 2 

1-54 \ , 

— / 2 Xi) d(3 ' 

ajVV + 1.54 2 / 

Recall that our goal here was to see if Bayesian 
analysis can reproduce the intuition that the origi- 
nal experiment could be trusted if replication 1 had 
been done, while it could not be trusted (in spite of 
its much larger sample size) had replication 2 been 
performed. Establishing this requires finding a 
prior distribution ir(l3) for which 7r(/5| y, x x ) has 
little effect on P(H 0 \ y, x x ), but 7r(/3 1 y, x 2 ) has a 
large effect on P(H 0 \ y, x 2 ). To achieve the first 
objective, 7r(/S) must be tightly concentrated near 
zero. To achieve the second, 7r(/S) must be such that 
large | y - x 2 \ , which suggests presence of a large 
bias, can result in a substantial shift of posterior 
mass for (3 away from zero. 
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A sensible candidate for the prior density ir((3) 
is the Cauchy (0, V) density 



tV[i + (e/vy] 

Flat-tailed densities, such as this, are well known 
to have the property that when discordant data is 
observed (e.g., when (| y — x 2 | is large), substan- 
tial mass shifts away from the prior center towards 
the likelihood center. It is easy to see that a normal 
prior for /3 can not have the desired behavior. 

Our first surprise in consideration of these priors 
was how small V needed to be chosen in order for 
P(H 0 1 y, x r ) to be unaffected by the bias. For 
instance, even with V = 1.54/100 (recall that 1.54 
was the standard deviation of Y from the original 
experiment), computation yields P(H 0 \ y, x r ) = 
4.3 x 10 ~ 5 , compared with the P- value (and poste- 
rior probability from the original experiment as- 
suming no bias) of 2.8 x 10 " 7 . There is a clear 
lesson here; even very small suspicions of bias can 
drastically alter a small P- value. Note that replica- 
tion 1 is very consistent with the presence of no 
bias, and so the posterior distribution for the bias 
remains tightly concentrated near zero; for in- 
stance, the mean of the posterior for /3 is then 
7.2 X 10" 6 , and the standard deviation is 0.25. 

When we turned attention to replication 2, we 
found that it did not seriously change the prior 
perceptions of bias. Examination quickly revealed 
the reason; even the maximum likelihood estimate 
of the bias is no more than 1.4 standard deviations 
from zero, which is not enough to change strong 
prior beliefs. We, therefore, considered a third 
experiment, defined in Table 1. Transforming to 
approximate normality, as before, yields 

X 3 ~iV(*3 | 0,3.48), 

with x 3 = 22.72 being the actual observation. The 
maximum likelihood estimate of bias is now 3.95 
standard deviations from zero, so there is potential 
for a substantial change in opinion about the bias. 

Sure enough, computation when V= 1.54/100 
yields that E[/3\ y, x 3 ]= -4.9 with (posterior) 
standard deviation equal to 6.62, which is a dra- 
matic shift from prior opinion (that (3 is Cauchy (0, 



Table 1 

Frequency of heart attacks in replication 3 



1.54/100)). The effect of this is to essentially ignore 
the original experiment in overall assessments of 
evidence. For instance, P(H 0 \ y, x 3 ) = 3.81 x 
10" n , which is very close to P(H 0 \ x 3 ) = 3.29 x 
10" 11 . Note that, if /3 were set equal to zero, the 
overall posterior probability of H 0 (and P-value) 
would be 2.62 x 10 ~ 13 . 

Thus Bayesian reasoning can reproduce the intu- 
ition that replication which indicates bias can cast 
considerable doubt on the original experiment, 
while replication which provides no evidence of 
bias leaves evidence from the original experiment 
intact. Such behavior seems only obtainable, how- 
ever, with flat-tailed priors for bias (such as the 
Cauchy) that are very concentrated (in comparison 
with the experimental standard deviation) near 
zero. 

3. P-VALUES OR BAYES FACTORS? 

Parapsychology experiments usually consider 
testing of H 0 : No parapsychological effect exists. 
Such null hypotheses are often realistically repre- 
sented as point nulls (see Berger and Delampady, 
1987, for the reason that care must be taken in 
such representation), in which case it is known that 
there is a large difference between P-values and 
posterior probabilities (see Berger and Delampady, 
1987, for review). The article by Jefferys (1990) 
dramatically illustrates this, showing that a very 
small P-value can actually correspond to evidence 
for H 0 when considered from a Bayesian perspec- 
tive. (This is very related to the famous "Jeffreys" 
paradox.) The argument in favor of the Bayesian 
approach here is very strong, since it can be shown 
that the conflict holds for virtually any sensible 
prior distribution; a Bayesian answer can be wrong 
if the prior information turns out to be inaccurate, 
but a Bayesian answer that holds for all sensible 
priors is unassailable. 

Since P-values simply cannot be viewed as mean- 
ingful in these situations, we found it of interest to 
reconsider the example in Section 5 from a Bayes 
factor perspective. We considered only analysis of 
the overall totals, that is, x = 122 successes out of 
n = 355 trials. Assuming a simple Bernoulli trial 
model with success probability 0, the goal is to test 
H 0 :6 = 1/4 versus H^O ± 1/4. 

To determine the Bayes factor here, one must 
specify g(d), the conditional prior density on H v 
Consider choosing g to be uniform and symmetric, 
that is, 





Yes 


No 


Aspirin 


5 


2309 


Placebo 


54 


2116 



G r (6) = 



11 1 

— , for - - r < 6 < - + r, 
2r 4 4 



0, 



otherwise . 
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Crudely, r could be considered to be the maximum 
change in success probability that one would expect 
given that ESP exists. Also, these distributions are 
the "extreme points" over the class of symmetric 
unimodal conditional densities, so answers that hold 
over this class are also representative of answers 
over a much larger class. Note that here r < 0.25 
(because 0 < 0 < 1); for the given data the 0 > 0.5 
are essentially irrelevant, but if it were deemed 
important to take them into account one could use 
the more sophisticated binomial analysis in Berger 
and Delampady (1987). 

For g r , the Bayes factor of H 1 to H 0 , which is to 
be interpreted as the relative odds for the hypothe- 
ses provided by the data, is given by 



B(r) = 



(l/(2r)) /. 2 2 5 5 _ + ;0 122 (l - 0) 355 " 122 



de 



(1/4) 122 (1 - 1/4) 



355-122 



- Tr (63-13) 



r - .0937 
.0252 



+ $ 



(r + .0937) 
.0252 



This is graphed in Figure 1. 

The P- value for this problem was 0.00005, indi- 
cating overwhelming evidence against H 0 from a 
classical perspective. In contrast to the situation 
studied by Jefferys (1990), the Bayes factor here 
does not completely reverse the conclusion, show- 
ing that there are very reasonable values of r for 
which the evidence against H 0 is moderately 
strong, for example 100/1 or 200/1. Of course, this 
evidence is probably not of sufficient strength to 
overcome strong prior opinions against H 0 (one 




Fig. 1. The Bayes factor of Hi to H 0 as a function of r, the 
maximum change in success probability that is expected given 
that ESP exists, for the ganzfeld experiment. 



obtains final posterior odds by multiplying prior 
odds by the Bayes factor). To properly assess 
strength of evidence, we feel that such Bayes factor 
computations should becomie standard in parapsy- 
chology. 

As mentioned by Professor Utts, Bayesian meth- 
ods have additional potential in situations such as 
this, by allowing unrealistic models of iid trials to 
be replaced by hierarchical models reflecting differ- 
ing abilities among subjects. 
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This paper offers readers interested in statistical 
science multiple views of the controversial history 
of parapsychology and how statistics has con- 
tributed to its development. It first provides an 
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account of how both design and inferential aspects 
of statistics have been pivotal issues in evaluating 
the outcomes of experiments that study psi abili- 
ties. It then emphasizes how the idea of science as 
replication has been key in this field in which 
results have not been conclusive or consistent and 
thus meta-analysis has been at the heart of the 
literature in parapsychology. The author not only 
reviews past debate on how to interpret repeated 
psi studies, but also provides very detailed informa- 
tion on the Honorton-Hyman argument, a nice 
illustration of the challenges of resolving such de- 
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